This project dives into an in-depth analysis of diabetes-related hospital admissions in the US, spanning a decade from 1999 to 2008. The core focus is on evaluating and understanding the patterns and trends in patient readmissions. Through a detailed examination of a comprehensive dataset, we shed light on various elements, including patient demographics, hospital attributes, treatment efficacy, and, crucially, the rates and determinants of readmission among diabetes patients.
Utilizing Python for analysis, we employed data manipulation tools like Pandas, Matplotlib and Seaborn for data visualization. Our dataset, obtained from the US Department of Health, encompasses over 100,000 records of hospital admissions. Advanced Classifier models such as Linear Regression, Random Forest, and Decision Tree from scikit-learn were used.
This study brings to light the critical challenges in managing diabetes in the US healthcare context. It underscores the urgent need for healthcare policy reforms targeted at reducing the frequency of readmissions. Our findings advocate for enhanced resource allocation toward preventive care and effective diabetes management strategies.
import numpy as np
import pandas as pd
#For Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
#For plots
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from matplotlib import rcParams
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
data = pd.read_csv('diabetic_data.csv', delimiter=',')
ids = pd.read_csv('IDS_mapping.csv', delimiter=',', header=None)
data.head()
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2278392 | 8222157 | Caucasian | Female | [0-10) | ? | 6 | 25 | 1 | 1 | ? | Pediatrics-Endocrinology | 41 | 0 | 1 | 0 | 0 | 0 | 250.83 | ? | ? | 1 | NaN | NaN | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | NO |
| 1 | 149190 | 55629189 | Caucasian | Female | [10-20) | ? | 1 | 1 | 7 | 3 | ? | ? | 59 | 0 | 18 | 0 | 0 | 0 | 276 | 250.01 | 255 | 9 | NaN | NaN | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | >30 |
| 2 | 64410 | 86047875 | AfricanAmerican | Female | [20-30) | ? | 1 | 1 | 7 | 2 | ? | ? | 11 | 5 | 13 | 2 | 0 | 1 | 648 | 250 | V27 | 6 | NaN | NaN | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | NO |
| 3 | 500364 | 82442376 | Caucasian | Male | [30-40) | ? | 1 | 1 | 7 | 2 | ? | ? | 44 | 1 | 16 | 0 | 0 | 0 | 8 | 250.43 | 403 | 7 | NaN | NaN | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 4 | 16680 | 42519267 | Caucasian | Male | [40-50) | ? | 1 | 1 | 7 | 1 | ? | ? | 51 | 0 | 8 | 0 | 0 | 0 | 197 | 157 | 250 | 5 | NaN | NaN | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | Ch | Yes | NO |
print("Number of Lines", data.shape[0])
print("Number of Columns", data.shape[1])
Number of Lines 101766 Number of Columns 50
ids.head()
| 0 | 1 | |
|---|---|---|
| 0 | admission_type_id | description |
| 1 | 1 | Emergency |
| 2 | 2 | Urgent |
| 3 | 3 | Elective |
| 4 | 4 | Newborn |
print("Number of Lines", ids.shape[0])
print("Number of Columns", ids.shape[1])
Number of Lines 68 Number of Columns 2
# Assessing missing values
missing_values = data.replace('?', pd.NA).isna().sum()
# Percentage of missing values
missing_percentage = (missing_values / len(data)) * 100
missing_summary = pd.DataFrame({'Number of Missing Values': missing_values, 'Percentage': missing_percentage})
missing_summary.sort_values(by='Percentage', ascending=False)
| Number of Missing Values | Percentage | |
|---|---|---|
| weight | 98569 | 96.858479 |
| max_glu_serum | 96420 | 94.746772 |
| A1Cresult | 84748 | 83.277322 |
| medical_specialty | 49949 | 49.082208 |
| payer_code | 40256 | 39.557416 |
| race | 2273 | 2.233555 |
| diag_3 | 1423 | 1.398306 |
| diag_2 | 358 | 0.351787 |
| diag_1 | 21 | 0.020636 |
| encounter_id | 0 | 0.000000 |
| troglitazone | 0 | 0.000000 |
| tolbutamide | 0 | 0.000000 |
| pioglitazone | 0 | 0.000000 |
| rosiglitazone | 0 | 0.000000 |
| acarbose | 0 | 0.000000 |
| miglitol | 0 | 0.000000 |
| citoglipton | 0 | 0.000000 |
| tolazamide | 0 | 0.000000 |
| examide | 0 | 0.000000 |
| glipizide | 0 | 0.000000 |
| insulin | 0 | 0.000000 |
| glyburide-metformin | 0 | 0.000000 |
| glipizide-metformin | 0 | 0.000000 |
| glimepiride-pioglitazone | 0 | 0.000000 |
| metformin-rosiglitazone | 0 | 0.000000 |
| metformin-pioglitazone | 0 | 0.000000 |
| change | 0 | 0.000000 |
| diabetesMed | 0 | 0.000000 |
| glyburide | 0 | 0.000000 |
| repaglinide | 0 | 0.000000 |
| acetohexamide | 0 | 0.000000 |
| glimepiride | 0 | 0.000000 |
| gender | 0 | 0.000000 |
| age | 0 | 0.000000 |
| admission_type_id | 0 | 0.000000 |
| discharge_disposition_id | 0 | 0.000000 |
| admission_source_id | 0 | 0.000000 |
| time_in_hospital | 0 | 0.000000 |
| num_lab_procedures | 0 | 0.000000 |
| num_procedures | 0 | 0.000000 |
| num_medications | 0 | 0.000000 |
| number_outpatient | 0 | 0.000000 |
| number_emergency | 0 | 0.000000 |
| number_inpatient | 0 | 0.000000 |
| number_diagnoses | 0 | 0.000000 |
| metformin | 0 | 0.000000 |
| patient_nbr | 0 | 0.000000 |
| nateglinide | 0 | 0.000000 |
| chlorpropamide | 0 | 0.000000 |
| readmitted | 0 | 0.000000 |
- Weight is missing though important
- Medical specialty: Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon + useless
- Payer code: Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay + useless
- Max glu serum: Indicates the range of the result or if the test was not taken. Values: >200, >300, normal, and none if not measured
- A1Cresult: Indicates the range of the result or if the test was not taken. Values: >8 if the result was greater than 8%, >7 if the result was greater than 7% but less than 8%, normal if the result was less than 7%, and none if not measured.
data.replace('?', pd.NA, inplace=True)
columns_to_drop = ['weight', 'medical_specialty', 'payer_code','max_glu_serum','A1Cresult']
data = data.drop(columns=columns_to_drop)
cols = ['race', 'diag_1', 'diag_2', 'diag_3']
data.dropna(subset=cols, inplace=True)
nombre_redondant = data.duplicated(subset=['patient_nbr']).sum()
data = data.drop_duplicates(subset=['patient_nbr'])
print("Nombre de patient_nbr redondants :", nombre_redondant)
Nombre de patient_nbr redondants : 29423
columns_to_drop = ['encounter_id', 'patient_nbr', 'payer_code']
#We drop the columns only if they exist in the DataFrame
data = data.drop([col for col in columns_to_drop if col in data.columns], axis=1)
We need to find out the number of undeclared genders, and then we will figure out what to do with them. Here, there are only 3 missing genders so we will just drop them.
Drop the variables with unknow gender
nombre_unknown = (data['gender'] == 'Unknown/Invalid').sum()
data = data[data['gender'] != 'Unknown/Invalid']
print("Nombre de 'unknown' dans la colonne 'gender':", nombre_unknown)
Nombre de 'unknown' dans la colonne 'gender': 1
We have several mapping
#mapping admissions id
mapping_admission_id = {1.0:"Emergency",
2.0:"Emergency",
3.0:"Elective",
4.0:"New Born",
5.0:np.nan,
6.0:np.nan,
7.0:"Trauma Center",
8.0:np.nan}
data.admission_type_id = data.admission_type_id.replace(mapping_admission_id)
plt.figure(figsize=(10, 6))
custom_palette = sns.color_palette("Reds", n_colors=len(data["admission_type_id"].unique()))
sns.countplot(x="admission_type_id", data=data, palette=custom_palette)
plt.xticks(rotation=90)
plt.show()
#mapping admission source id
mapping_admission_source_id = {1:"Referral",2:"Referral",3:"Referral",
4:"Other",5:"Other",6:"Other",10:"Other",22:"Other",25:"Other",
9:"Other",8:"Other",14:"Other",13:"Other",11:"Other",
15:np.nan,17:np.nan,20:np.nan,21:np.nan,
7:"Emergency"}
data.admission_source_id = data.admission_source_id.replace(mapping_admission_source_id)
plt.figure(figsize=(10, 6))
custom_palette = sns.color_palette("pastel", n_colors=len(data["admission_source_id"].unique()))
sns.countplot(x="admission_source_id", data=data, palette=custom_palette)
plt.xticks(rotation=90)
plt.show()
#mapping discharge disposition
mapping_discharge_disp= {1:"Referral",2:"Referral",3:"Referral",
4:"Other",5:"Other",6:"Other",10:"Other",22:"Other",25:"Other",
9:"Other",8:"Other",14:"Other",13:"Other",11:"Other",
15:np.nan,17:np.nan,20:np.nan,21:np.nan,
7:"Emergency"}
data.discharge_disposition_id = data.discharge_disposition_id.replace(mapping_discharge_disp)
plt.figure(figsize=(6, 6))
custom_palette = sns.color_palette("pastel", n_colors=len(data["discharge_disposition_id"].unique()))
sns.countplot(x="discharge_disposition_id", data=data, palette=custom_palette)
plt.xticks(rotation=90)
plt.show()
#mapping admission type id
mapping_admission_type_id= {1:"Emergency",2:"Urgent",3:"Elective",
4:"Newborn",5:"Not Available",6:"NULL",7:"Trauma Center",8:"Not Mapped"}
data.admission_type_id = data.admission_type_id.replace(mapping_admission_type_id)
data['diag_1'] = pd.to_numeric(data['diag_1'], errors='coerce')
data['diag_2'] = pd.to_numeric(data['diag_2'], errors='coerce')
data['diag_3'] = pd.to_numeric(data['diag_3'], errors='coerce')
data['diag_1'] = data['diag_1'].astype(float)
data['diag_2'] = data['diag_2'].astype(float)
data['diag_3'] = data['diag_3'].astype(float)
cols = ['diag_1', 'diag_2', 'diag_3']
data.dropna(subset=cols, inplace=True)
(vii) Re-Writing Data
We do this in the purpose of having simpler and adjusted datasets.
gender_types = data['age'].unique()
print("Différents types dans la colonne 'gender':")
print(gender_types)
Différents types dans la colonne 'gender': ['[10-20)' '[30-40)' '[40-50)' '[50-60)' '[70-80)' '[80-90)' '[90-100)' '[60-70)' '[20-30)' '[0-10)']
replaceDict = {'[0-10)' : 5,
'[10-20)' : 15,
'[20-30)' : 25,
'[30-40)' : 35,
'[40-50)' : 45,
'[50-60)' : 55,
'[60-70)' : 65,
'[70-80)' : 75,
'[80-90)' : 85,
'[90-100)' : 95}
data['age'] = data['age'].apply(lambda x : replaceDict[x])
print(data['age'].head())
1 15 3 35 4 45 5 55 7 75 Name: age, dtype: int64
colonnes = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide',
'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone',
'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed']
for colonne in colonnes:
valeurs_uniques = data[colonne].unique()
print("Colonnes :", colonne)
print("Différentes valeurs uniques :")
print(valeurs_uniques)
print()
Colonnes : metformin Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : repaglinide Différentes valeurs uniques : ['No' 'Up' 'Steady' 'Down'] Colonnes : nateglinide Différentes valeurs uniques : ['No' 'Steady' 'Down' 'Up'] Colonnes : chlorpropamide Différentes valeurs uniques : ['No' 'Steady' 'Down' 'Up'] Colonnes : glimepiride Différentes valeurs uniques : ['No' 'Steady' 'Down' 'Up'] Colonnes : acetohexamide Différentes valeurs uniques : ['No' 'Steady'] Colonnes : glipizide Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : glyburide Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : tolbutamide Différentes valeurs uniques : ['No' 'Steady'] Colonnes : pioglitazone Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : rosiglitazone Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : acarbose Différentes valeurs uniques : ['No' 'Steady' 'Up'] Colonnes : miglitol Différentes valeurs uniques : ['No' 'Steady' 'Down' 'Up'] Colonnes : troglitazone Différentes valeurs uniques : ['No' 'Steady'] Colonnes : tolazamide Différentes valeurs uniques : ['No' 'Steady'] Colonnes : examide Différentes valeurs uniques : ['No'] Colonnes : citoglipton Différentes valeurs uniques : ['No'] Colonnes : insulin Différentes valeurs uniques : ['Up' 'Steady' 'No' 'Down'] Colonnes : glyburide-metformin Différentes valeurs uniques : ['No' 'Steady' 'Down' 'Up'] Colonnes : glipizide-metformin Différentes valeurs uniques : ['No' 'Steady'] Colonnes : glimepiride-pioglitazone Différentes valeurs uniques : ['No'] Colonnes : metformin-rosiglitazone Différentes valeurs uniques : ['No'] Colonnes : metformin-pioglitazone Différentes valeurs uniques : ['No' 'Steady'] Colonnes : change Différentes valeurs uniques : ['Ch' 'No'] Colonnes : diabetesMed Différentes valeurs uniques : ['Yes' 'No']
(viii) Eliminating Data with Zero Variance
As we can see just above there are data with only one data possible (it mean that for everyone in the data set this data value is the same => it is useless) so we will be looking at zero variance data(data that don't change a lot will be removed)
from sklearn.feature_selection import VarianceThreshold
encoded_colonnes = data[colonnes].copy()
#Replace values in the selected columns
replacement_dict = {'Up': 2, 'Steady': 1, 'No': -1, 'Down': -2,'Ch' : 1, 'Yes' : 1}
encoded_colonnes.replace(replacement_dict, inplace=True)
#Apply VarianceThreshold to remove near-zero variance columns
threshold = 0.05 # Set the threshold as desired (adjust as needed)
selector = VarianceThreshold(threshold=threshold)
filtered_colonnes = selector.fit_transform(encoded_colonnes)
selected_columns = encoded_colonnes.columns[selector.get_support()]
print("Here are the column that we will be keeping : " , list(selected_columns))
toDelete=list(set(colonnes)-set(selected_columns))
print("We will delete : ", len(toDelete), " columns !")
data = data.drop([col for col in toDelete if col in data.columns], axis=1)
Here are the column that we will be keeping : ['metformin', 'repaglinide', 'glimepiride', 'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'insulin', 'change', 'diabetesMed'] We will delete : 15 columns !
(ix) Mapping Diag Columns
In the paper given we can see that each number within diag is corresponding to a type of medecine example 391 is for circulatory issues
def mapping_diag(data, col):
conditions = [
((data[col] >= 390) & (data[col] < 460)) | (data[col] == 785),
((data[col] >= 460) & (data[col] < 520)) | (data[col] == 786),
((data[col] >= 520) & (data[col] < 580)) | (data[col] == 787),
(data[col] == 250),
((data[col] >= 800) & (data[col] <= 1000)),
((data[col] >= 710) & (data[col] <= 740)),
((data[col] >= 580) & (data[col] < 630)) | (data[col] == 788),
((data[col] >= 140) & (data[col] <= 240)),
((data[col] >= 630) & (data[col] <= 680))
]
choices = [
'Circulatory',
'Respiratory',
'Digestive',
'Diabetes',
'Injury',
'Musculoskeletal',
'Genitourinary',
'Neoplasms',
'Pregnancy'
]
data[col] = np.select(conditions, choices, default='Other')
return data
data=mapping_diag(data,"diag_1")
data=mapping_diag(data,"diag_2")
data=mapping_diag(data,"diag_3")
count_other = data[data['diag_1'] == 'Other'].shape[0]
print("Number of 'Other' occurrences in diag_1 column:", count_other)
count_other = data[data['diag_2'] == 'Other'].shape[0]
print("Number of 'Other' occurrences in diag_1 column:", count_other)
count_other = data[data['diag_3'] == 'Other'].shape[0]
print("Number of 'Other' occurrences in diag_1 column:", count_other)
Number of 'Other' occurrences in diag_1 column: 14269 Number of 'Other' occurrences in diag_1 column: 19067 Number of 'Other' occurrences in diag_1 column: 19186
(x) Categorical Data
Transforming categorical columns into a data_cleaned dataframe
data_cleaned=data
#mapping des admissions id
mapping_admission_id = {1.0:"Emergency",
2.0:"Emergency",
3.0:"Elective",
4.0:"New Born",
5.0:np.nan,
6.0:np.nan,
7.0:"Trauma Center",
8.0:np.nan}
data_cleaned.admission_type_id = data.admission_type_id.replace(mapping_admission_id)
#mapping des admission source id
data_cleaned=data
mapping_admission_source_id = {1:"Referral",2:"Referral",3:"Referral",
4:"Other",5:"Other",6:"Other",10:"Other",22:"Other",25:"Other",
9:"Other",8:"Other",14:"Other",13:"Other",11:"Other",
15:np.nan,17:np.nan,20:np.nan,21:np.nan,
7:"Emergency"}
data_cleaned.admission_source_id = data_cleaned.admission_source_id.replace(mapping_admission_source_id)
#mapping des discharge disposition
mapping_discharge_disp= {1:"Referral",2:"Referral",3:"Referral",
4:"Other",5:"Other",6:"Other",10:"Other",22:"Other",25:"Other",
9:"Other",8:"Other",14:"Other",13:"Other",11:"Other",
15:np.nan,17:np.nan,20:np.nan,21:np.nan,
7:"Emergency"}
data_cleaned.discharge_disposition_id = data_cleaned.discharge_disposition_id.replace(mapping_discharge_disp)
cols = ['admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'diag_3']
data.dropna(subset=cols, inplace=True)
(xi) Encoding Data
Transforming non-numerical columns into a data_encoded dataframe
for colonne in data.columns:
valeurs_uniques = data[colonne].unique()
print("Colonnes :", colonne)
print("Différentes valeurs uniques :")
print(valeurs_uniques)
print()
Colonnes : race Différentes valeurs uniques : ['Caucasian' 'AfricanAmerican' 'Other' 'Asian' 'Hispanic'] Colonnes : gender Différentes valeurs uniques : ['Female' 'Male'] Colonnes : age Différentes valeurs uniques : [15 35 45 55 75 85 95 65 25 5] Colonnes : admission_type_id Différentes valeurs uniques : ['Emergency' 'Elective' 'New Born' 'Trauma Center'] Colonnes : discharge_disposition_id Différentes valeurs uniques : ['Referral' 'Other' 'Emergency' 18 12 23 28 24 19 27] Colonnes : admission_source_id Différentes valeurs uniques : ['Emergency' 'Referral' 'Other'] Colonnes : time_in_hospital Différentes valeurs uniques : [ 3 2 1 5 13 12 9 7 10 4 6 11 8 14] Colonnes : num_lab_procedures Différentes valeurs uniques : [ 59 44 51 31 73 68 33 47 62 60 55 49 75 45 29 35 19 64 25 53 52 87 27 37 46 41 28 36 72 10 2 65 67 40 58 57 32 34 39 69 56 22 96 78 48 70 66 18 9 24 63 76 90 77 80 71 50 38 43 61 42 84 74 23 54 95 83 30 12 79 13 81 15 1 86 5 26 6 8 11 100 21 88 7 89 91 16 94 82 93 14 105 97 4 101 17 98 20 85 3 92 102 108 106 99 113 109 111 103 132 121 118] Colonnes : num_procedures Différentes valeurs uniques : [0 1 6 2 3 5 4] Colonnes : num_medications Différentes valeurs uniques : [18 16 8 12 28 17 11 15 31 2 13 23 7 20 14 10 22 9 27 19 25 21 6 30 24 33 29 4 40 5 46 41 36 26 43 42 34 51 35 3 45 37 32 38 1 54 49 39 52 47 44 48 53 62 50 56 61 59 57 55 58 70 67 63 64 60 69 65 66 68 81 79 75 72 74] Colonnes : number_outpatient Différentes valeurs uniques : [ 0 1 5 2 7 3 8 4 12 11 6 20 15 14 9 10 21 16 29 13 18 19 22 24 42 33 17 25 23] Colonnes : number_emergency Différentes valeurs uniques : [ 0 1 4 2 3 6 8 7 13 5 42 10 9 25 11 37 20 16] Colonnes : number_inpatient Différentes valeurs uniques : [ 0 2 1 3 5 4 6 8 11 7 9 10 12] Colonnes : diag_1 Différentes valeurs uniques : ['Other' 'Neoplasms' 'Circulatory' 'Respiratory' 'Injury' 'Genitourinary' 'Musculoskeletal' 'Digestive' 'Pregnancy' 'Diabetes'] Colonnes : diag_2 Différentes valeurs uniques : ['Other' 'Neoplasms' 'Circulatory' 'Respiratory' 'Injury' 'Genitourinary' 'Digestive' 'Diabetes' 'Musculoskeletal' 'Pregnancy'] Colonnes : diag_3 Différentes valeurs uniques : ['Other' 'Circulatory' 'Diabetes' 'Respiratory' 'Injury' 'Neoplasms' 'Genitourinary' 'Digestive' 'Musculoskeletal' 'Pregnancy'] Colonnes : number_diagnoses Différentes valeurs uniques : [ 9 7 5 8 3 6 4 16 12 13 15 11 14 10] Colonnes : metformin Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : repaglinide Différentes valeurs uniques : ['No' 'Up' 'Steady' 'Down'] Colonnes : glimepiride Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : glipizide Différentes valeurs uniques : ['No' 'Steady' 'Down' 'Up'] Colonnes : glyburide Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : pioglitazone Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : rosiglitazone Différentes valeurs uniques : ['No' 'Steady' 'Up' 'Down'] Colonnes : insulin Différentes valeurs uniques : ['Up' 'Steady' 'No' 'Down'] Colonnes : change Différentes valeurs uniques : ['Ch' 'No'] Colonnes : diabetesMed Différentes valeurs uniques : ['Yes' 'No'] Colonnes : readmitted Différentes valeurs uniques : ['>30' 'NO' '<30']
data_encoded = pd.DataFrame()
def categorize_readmission(value):
if value == 'NO':
return 0
else:
return 1
#Apply the function to the 'readmitted' column and create a new column with the results
data_encoded['readmitted'] = data['readmitted'].apply(categorize_readmission)
#Categorical columns for one-hot encoding
categorical_cols = ['gender', 'admission_type_id', 'discharge_disposition_id',
'admission_source_id', 'diag_1', 'diag_2', 'diag_3',
'metformin', 'repaglinide', 'glimepiride', 'glipizide',
'glyburide', 'pioglitazone', 'rosiglitazone', 'insulin',
'change', 'diabetesMed']
#Encode each categorical column
for col in categorical_cols:
dummies = pd.get_dummies(data[col], prefix=col, drop_first=True)
data_encoded = pd.concat([data_encoded, dummies], axis=1)
#Add numerical columns as they are
numerical_cols = ['age', 'time_in_hospital', 'num_lab_procedures', 'num_procedures',
'num_medications', 'number_outpatient', 'number_emergency',
'number_inpatient', 'number_diagnoses']
data_encoded = pd.concat([data_encoded, data[numerical_cols]], axis=1)
data_encoded.head(3)
| readmitted | gender_Male | admission_type_id_Emergency | admission_type_id_New Born | admission_type_id_Trauma Center | discharge_disposition_id_18 | discharge_disposition_id_19 | discharge_disposition_id_23 | discharge_disposition_id_24 | discharge_disposition_id_27 | discharge_disposition_id_28 | discharge_disposition_id_Emergency | discharge_disposition_id_Other | discharge_disposition_id_Referral | admission_source_id_Other | admission_source_id_Referral | diag_1_Diabetes | diag_1_Digestive | diag_1_Genitourinary | diag_1_Injury | diag_1_Musculoskeletal | diag_1_Neoplasms | diag_1_Other | diag_1_Pregnancy | diag_1_Respiratory | diag_2_Diabetes | diag_2_Digestive | diag_2_Genitourinary | diag_2_Injury | diag_2_Musculoskeletal | diag_2_Neoplasms | diag_2_Other | diag_2_Pregnancy | diag_2_Respiratory | diag_3_Diabetes | diag_3_Digestive | diag_3_Genitourinary | diag_3_Injury | diag_3_Musculoskeletal | diag_3_Neoplasms | diag_3_Other | diag_3_Pregnancy | diag_3_Respiratory | metformin_No | metformin_Steady | metformin_Up | repaglinide_No | repaglinide_Steady | repaglinide_Up | glimepiride_No | glimepiride_Steady | glimepiride_Up | glipizide_No | glipizide_Steady | glipizide_Up | glyburide_No | glyburide_Steady | glyburide_Up | pioglitazone_No | pioglitazone_Steady | pioglitazone_Up | rosiglitazone_No | rosiglitazone_Steady | rosiglitazone_Up | insulin_No | insulin_Steady | insulin_Up | change_No | diabetesMed_Yes | age | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | number_diagnoses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | True | False | False | True | False | False | True | False | False | True | False | False | True | False | False | True | False | False | True | False | False | False | False | True | False | True | 15 | 3 | 59 | 0 | 18 | 0 | 0 | 0 | 9 |
| 3 | 0 | True | True | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | False | False | True | False | False | True | False | False | True | False | False | True | False | False | True | False | False | False | False | True | False | True | 35 | 2 | 44 | 1 | 16 | 0 | 0 | 0 | 7 |
| 4 | 0 | True | True | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | True | False | False | True | False | False | False | True | False | True | False | False | True | False | False | True | False | False | False | True | False | False | True | 45 | 1 | 51 | 0 | 8 | 0 | 0 | 0 | 5 |
cols_to_sum = ['number_emergency', 'number_inpatient', 'number_outpatient',
'time_in_hospital', 'num_procedures', 'num_medications',
'num_lab_procedures', 'number_diagnoses']
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data[cols_to_sum]),
columns=cols_to_sum,
index=data.index)
data['health'] = data_scaled.sum(axis=1)
object_columns = data.columns.tolist()
X=data[object_columns]
y = data['readmitted']
X_str = X.applymap(str)
le = LabelEncoder()
X_encoded = X_str.apply(le.fit_transform)
y_encoded = le.fit_transform(y)
scores, pvalues = chi2(X_encoded, y_encoded)
for col, p in sorted(zip(object_columns, pvalues),key = lambda x:x[1],reverse = True):
print(f"{col}: {p}")
insulin: 0.9845946231347301 glimepiride: 0.9220213529372799 diag_2: 0.8587245323491828 repaglinide: 0.7638825309383867 glyburide: 0.5710481526624618 pioglitazone: 0.42117989442777937 race: 0.376899054575053 rosiglitazone: 0.3071210114680176 metformin: 0.2927136593791332 glipizide: 0.2763222149381983 gender: 0.03302510488129809 admission_type_id: 1.2247881503442352e-06 discharge_disposition_id: 1.0402444969718035e-06 diag_3: 1.0404460655681793e-08 change: 5.368607337761889e-10 diabetesMed: 2.947011238818089e-13 diag_1: 4.411480148584445e-19 age: 1.4331081387194604e-25 num_procedures: 6.066056972462838e-26 admission_source_id: 1.3290849176207581e-40 number_diagnoses: 9.526661076607155e-42 time_in_hospital: 2.915168094672371e-69 num_lab_procedures: 6.483474719457476e-147 num_medications: 0.0 number_outpatient: 0.0 number_emergency: 0.0 number_inpatient: 0.0 readmitted: 0.0 health: 0.0
pvalue_dict = dict(zip(object_columns, pvalues))
selected_columns = [col for col, p in pvalue_dict.items() if p <= 0.3]
dataForClassification= data[selected_columns]
dataForClassification.head()
| gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_3 | number_diagnoses | metformin | glipizide | change | diabetesMed | readmitted | health | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Female | 15 | Emergency | Referral | Emergency | 3 | 59 | 0 | 18 | 0 | 0 | 0 | Other | Other | 9 | No | No | Ch | Yes | >30 | -0.148877 |
| 3 | Male | 35 | Emergency | Referral | Emergency | 2 | 44 | 1 | 16 | 0 | 0 | 0 | Other | Circulatory | 7 | No | No | Ch | Yes | NO | -1.994875 |
| 4 | Male | 45 | Emergency | Referral | Emergency | 1 | 51 | 0 | 8 | 0 | 0 | 0 | Neoplasms | Diabetes | 5 | No | Steady | Ch | Yes | NO | -4.544908 |
| 5 | Male | 55 | Emergency | Referral | Referral | 3 | 31 | 6 | 16 | 0 | 0 | 0 | Circulatory | Diabetes | 9 | No | No | No | Yes | >30 | 1.529434 |
| 7 | Male | 75 | Emergency | Referral | Emergency | 5 | 73 | 0 | 12 | 0 | 0 | 0 | Circulatory | Diabetes | 8 | No | No | No | Yes | >30 | -0.001356 |
df = data[data["readmitted"] == "NO"]
df=df.drop("readmitted", axis=1)
df.head()
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | metformin | repaglinide | glimepiride | glipizide | glyburide | pioglitazone | rosiglitazone | insulin | change | diabetesMed | health | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Caucasian | Male | 35 | Emergency | Referral | Emergency | 2 | 44 | 1 | 16 | 0 | 0 | 0 | Other | Other | Circulatory | 7 | No | No | No | No | No | No | No | Up | Ch | Yes | -1.994875 |
| 4 | Caucasian | Male | 45 | Emergency | Referral | Emergency | 1 | 51 | 0 | 8 | 0 | 0 | 0 | Neoplasms | Neoplasms | Diabetes | 5 | No | No | No | Steady | No | No | No | Steady | Ch | Yes | -4.544908 |
| 8 | Caucasian | Female | 85 | Emergency | Referral | Other | 13 | 68 | 2 | 28 | 0 | 0 | 0 | Circulatory | Circulatory | Other | 8 | No | No | No | Steady | No | No | No | Steady | Ch | Yes | 5.460957 |
| 9 | Caucasian | Female | 95 | Elective | Referral | Other | 12 | 33 | 3 | 18 | 0 | 0 | 0 | Circulatory | Neoplasms | Respiratory | 8 | No | No | No | No | No | No | Steady | Steady | Ch | Yes | 2.712900 |
| 13 | Caucasian | Male | 85 | Emergency | Other | Emergency | 10 | 55 | 1 | 31 | 0 | 0 | 0 | Circulatory | Circulatory | Circulatory | 8 | No | No | No | No | No | No | No | Steady | No | Yes | 3.581884 |
object_columns = df.columns.tolist()
X=df[object_columns]
y = df['time_in_hospital']
X_str = X.applymap(str)
le = LabelEncoder()
X_encoded = X_str.apply(le.fit_transform)
y_encoded = le.fit_transform(y)
scores, pvalues = chi2(X_encoded, y_encoded)
for col, p in sorted(zip(object_columns, pvalues),key = lambda x:x[1],reverse = True):
print(f"{col}: {p}")
glimepiride: 0.9999991220928548 repaglinide: 0.9999985917546521 pioglitazone: 0.9999933010688921 rosiglitazone: 0.9999493250644083 glipizide: 0.9967681880425915 glyburide: 0.9957001425055398 race: 0.6071671998384902 metformin: 0.4294585077261339 gender: 0.004721128561146807 number_emergency: 5.693645553959853e-05 admission_type_id: 8.27761135567137e-07 diabetesMed: 1.029605224401174e-09 insulin: 2.120330812734523e-12 discharge_disposition_id: 1.8957265135220723e-20 diag_1: 2.4802638151403367e-24 admission_source_id: 9.635565747763743e-29 number_inpatient: 1.728516010165934e-30 change: 3.4701419693626655e-40 number_outpatient: 3.775637771913903e-58 age: 2.0007922774861345e-79 number_diagnoses: 1.5770478540342118e-139 diag_2: 1.6857616786074464e-167 diag_3: 2.473449034724946e-232 time_in_hospital: 0.0 num_lab_procedures: 0.0 num_procedures: 0.0 num_medications: 0.0 health: 0.0
pvalue_dict = dict(zip(object_columns, pvalues))
selected_columns = [col for col, p in pvalue_dict.items() if p <= 0.3]
dataForReg = data[selected_columns]
data['isReadmitted'] = data['readmitted'].apply(lambda x: x in ['<30', '>30'])
counts = data['isReadmitted'].value_counts()
count_true = data['isReadmitted'].sum()
count_false = len(data) - count_true
data_false = data[data['isReadmitted'] == False].sample(n=count_true, random_state=42)
data_true = data[data['isReadmitted'] == True]
data = pd.concat([data_true, data_false])
data = data.sample(frac=1, random_state=42)
counts = data['isReadmitted'].value_counts()
print("Répartition après : ",counts)
Répartition après : isReadmitted True 21813 False 21813 Name: count, dtype: int64
(xv) End of Preprocessing
So now we got 28 columns(versus 50 before) and 63002 rows (101766 before)
We have three dataframes: Data, Data_cleaned (with categorical tabulation), and Data_encoded (only encoded data)
data.head()
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | metformin | repaglinide | glimepiride | glipizide | glyburide | pioglitazone | rosiglitazone | insulin | change | diabetesMed | readmitted | health | isReadmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 62983 | Caucasian | Male | 85 | Emergency | Referral | Emergency | 2 | 56 | 1 | 17 | 0 | 0 | 0 | Respiratory | Respiratory | Circulatory | 9 | No | No | No | No | No | No | No | No | No | Yes | >30 | -0.201808 | True |
| 10269 | Caucasian | Female | 45 | Emergency | Referral | Emergency | 5 | 68 | 0 | 13 | 0 | 0 | 3 | Digestive | Other | Other | 6 | No | No | No | No | No | No | No | Steady | No | Yes | >30 | 3.905023 | True |
| 35717 | Caucasian | Female | 85 | Emergency | Other | Emergency | 1 | 26 | 0 | 9 | 0 | 1 | 1 | Injury | Injury | Other | 9 | No | No | No | No | No | No | No | No | No | No | >30 | 0.009277 | True |
| 3992 | Caucasian | Male | 95 | Emergency | Other | Emergency | 3 | 36 | 0 | 9 | 0 | 0 | 0 | Other | Respiratory | Other | 5 | No | No | No | No | No | No | No | Steady | No | Yes | NO | -4.513091 | False |
| 22678 | AfricanAmerican | Male | 75 | Emergency | Other | Emergency | 10 | 66 | 1 | 27 | 0 | 0 | 1 | Circulatory | Circulatory | Respiratory | 6 | No | No | No | No | No | No | No | No | No | No | NO | 4.306637 | False |
fig, axes = plt.subplots(3, 3, figsize=(20, 24)) # Adjust the size as needed
# Unified color palette
custom_palette = sns.color_palette("pastel", n_colors=10) # Adjust 'n_colors' as needed
# Plot 1: Age distribution
sns.countplot(x="age", data=data, palette=custom_palette, ax=axes[0, 0])
axes[0, 0].set_title('Age Distribution')
axes[0, 0].tick_params(axis='x', rotation=90)
# Plot 2: Race distribution
sns.countplot(x="race", data=data, palette=custom_palette, ax=axes[0, 1])
axes[0, 1].set_title('Race Distribution')
axes[0, 1].tick_params(axis='x', rotation=90)
# Plot 3: Gender distribution
sns.countplot(x="gender", data=data, palette=custom_palette, ax=axes[0, 2])
axes[0, 2].set_title('Gender Distribution')
axes[0, 2].tick_params(axis='x', rotation=90)
# Plot 4: Admission type distribution
sns.countplot(x="admission_type_id", data=data, palette=custom_palette, ax=axes[1, 0])
axes[1, 0].set_title('Admission Type Distribution')
axes[1, 0].tick_params(axis='x', rotation=90)
# Plot 5: Admission source distribution
sns.countplot(x="admission_source_id", data=data_cleaned, palette=custom_palette, ax=axes[1, 1])
axes[1, 1].set_title('Admission Source Distribution')
axes[1, 1].tick_params(axis='x', rotation=90)
# Plot 6: Discharge disposition distribution
sns.countplot(x="discharge_disposition_id", data=data_cleaned, palette=custom_palette, ax=axes[1, 2])
axes[1, 2].set_title('Discharge Disposition Distribution')
axes[1, 2].tick_params(axis='x', rotation=90)
# Plot 7: Time in hospital distribution
sns.countplot(x="time_in_hospital", data=data_cleaned, palette=custom_palette, ax=axes[2, 0])
axes[2, 0].set_title('Time in Hospital Distribution')
axes[2, 0].tick_params(axis='x', rotation=90)
# Plot 8: Readmission distribution
sns.countplot(x="readmitted", data=data_cleaned, palette=custom_palette, ax=axes[2, 1])
axes[2, 1].set_title('Readmission Distribution')
axes[2, 1].tick_params(axis='x', rotation=90)
# Hide the empty subplot
axes[2, 2].axis('off')
# Adjust layout
plt.tight_layout()
plt.show()
After conducting ananalysis of the provided graphs, several key findings emerge. Firstly, the data reveals a significant prevalence of Caucasian seniors within the patient population. Specifically, the majority of individuals seeking medical care fall within the age range of 55 to 85 years old.
In terms of gender distribution, no definitive conclusions can be drawn as the number of cases is nearly evenly split between males and females. Regarding hospital admissions, it is evident that a majority of patients are admitted due to emergency circumstances. Additionally, their discharge disposition typically occurs through a referral process. Notably, the data indicates a decreasing trend in the length of hospital stays, with the majority of patients remaining hospitalized for a duration of 1 to 5 days.
The final graph focuses on patient readmissions. The data demonstrates a relatively equal distribution between patients who are readmitted and those who are not. However, what varies is the number of days elapsed before readmission takes place. Overall, the parameters studied in this analysis do not reveal any distinct characteristics specific to diabetic patients. Nevertheless, a significant proportion of these individuals experience readmission to the hospital. Therefore, it would be valuable to explore potential correlations between these parameters and the readmission status of diabetic patients.
fig, axes = plt.subplots(4, 2, figsize=(14, 12))
# Plot for race
palette = ["#FFC0CB", "#6495ED","#FFFF00"]
sns.countplot(x="race", hue="readmitted", data=data, palette=palette, ax=axes[0, 0])
axes[0, 0].set_title('Readmission according to the race')
# Plot for age
sns.countplot(x="age", hue="readmitted", data=data, palette=palette, ax=axes[0, 1])
axes[0, 1].set_title('Readmission according to the age')
# Plot for gender
sns.countplot(x="gender", hue="readmitted", data=data, palette=palette, ax=axes[1, 0])
axes[1, 0].set_title('Readmission according to the gender')
# Plot for time spent in hospital
sns.countplot(x="time_in_hospital", hue="readmitted", data=data, palette=palette, ax=axes[1, 1])
axes[1, 1].set_title('Readmission according to the time spent in the hospital')
#Plot for admission type id
sns.countplot(x="admission_type_id", hue="readmitted", data=data, palette=palette, ax=axes[2, 0])
axes[2, 0].set_title('Readmission according to the admission type id')
#Plot for admission type source
sns.countplot(x="admission_source_id", hue="readmitted", data=data, palette=palette, ax=axes[2, 1])
axes[2, 1].set_title('Readmission according to the admission source id')
#Plot for discharge disposition id
sns.countplot(x="discharge_disposition_id", hue="readmitted", data=data, palette=palette, ax=axes[3, 0])
axes[3, 0].set_title('Readmission according to the discharge disposition id')
#Plot for number of diagnoses
sns.countplot(x="number_diagnoses", hue="readmitted", data=data, palette=palette, ax=axes[3, 1])
axes[3, 1].set_title('Readmission according to the number of medications')
plt.tight_layout()
plt.show()
Due to the imbalanced distribution of classes in the dataset, with the "NO" class constituting approximately 50% of the data and the other classes having significantly fewer labels, a class imbalance problem arises. To mitigate this issue, we will reframe the problem as a two-class classification task. Specifically, our objective will be to predict whether a patient will be readmitted or not, without considering the distinction between readmissions within less than 30 days or greater than 30 days.
By simplifying the problem in this manner, we can effectively address the class imbalance problem and focus our predictive efforts on the primary task of determining readmission likelihood. This approach enables us to allocate appropriate attention and resources to accurately predicting the patient's readmission status
#Readmission status according to the race,age, gender and time spent in hospital
fig, axes = plt.subplots(4, 2, figsize=(18, 16))
# Plot for race
palette = ["#FFC0CB", "#6495ED"]
sns.countplot(x="race", hue="readmitted", data=data, palette=palette, ax=axes[0, 0])
axes[0, 0].set_title('Readmission according to the race')
# Plot for age
sns.countplot(x="age", hue="readmitted", data=data, palette=palette, ax=axes[0, 1])
axes[0, 1].set_title('Readmission according to the age')
# Plot for gender
sns.countplot(x="gender", hue="readmitted", data=data, palette=palette, ax=axes[1, 0])
axes[1, 0].set_title('Readmission according to the gender')
# Plot for time spent in hospital
sns.countplot(x="time_in_hospital", hue="readmitted", data=data, palette=palette, ax=axes[1, 1])
axes[1, 1].set_title('Readmission according to the time spent in the hospital')
# Plot for admission type id
sns.countplot(x="admission_type_id", hue="readmitted", data=data, palette=palette, ax=axes[2, 0])
axes[2, 0].set_title('Readmission according to the admission type id')
# Plot for admission source id
sns.countplot(x="admission_source_id", hue="readmitted", data=data, palette=palette, ax=axes[2, 1])
axes[2, 1].set_title('Readmission according to the admission source id')
# Plot for discharge disposition id
sns.countplot(x="discharge_disposition_id", hue="readmitted", data=data, palette=palette, ax=axes[3, 0])
axes[3, 0].set_title('Readmission according to the discharge disposition id')
# Plot for number of diagnoses
sns.countplot(x="number_diagnoses", hue="readmitted", data=data, palette=palette, ax=axes[3, 1])
axes[3, 1].set_title('Readmission according to the number of diagnoses')
plt.tight_layout()
plt.show()
Upon analyzing the various graphs depicting the relationship between readmission and different parameters such as race, age, gender, and time spent in the hospital, a notable observation emerges. Overall, these graphs reveal a lack of discernible distinctions between individuals who were readmitted and those who were not. This indicates that the factors being examined do not exhibit a significant impact on the likelihood of readmission.
For instance, when transitioning from analyzing the influence of race to age, it becomes apparent that both readmitted and not readmitted individuals display similar trends. There is no evident divergence in the patterns followed by each status class, suggesting that age does not play a substantial role in readmission likelihood.
Similarly, when exploring the relationship between gender and readmission, the graphs demonstrate comparable trajectories for both status classes. This similarity implies that gender does not serve as a distinguishing factor in predicting readmission probability.
Furthermore, examining the impact of the time spent in the hospital on readmission reveals a lack of substantial differentiation between the two status classes. The trends observed for both readmitted and not readmitted individuals follow nearly identical paths, indicating that the duration of hospitalization does not significantly affect the likelihood of readmission.
Overall, these findings suggest that the parameters investigated, including race, age, gender, and time spent in the hospital, do not exhibit clear distinctions between individuals who are readmitted and those who are not. This indicates the need for further exploration and consideration of additional factors that may contribute to the prediction of readmission likelihood in order to gain a more comprehensive understanding of the underlying factors at play.
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Plot configurations
plot_configs = [
{'x': 'num_lab_procedures', 'title': 'Relationship of Lab Procedures with Readmission'},
{'x': 'time_in_hospital', 'title': 'Relationship of Time in Hospital with Readmission'},
{'x': 'num_medications', 'title': 'Relationship of Medications with Readmission'},
{'x': 'num_procedures', 'title': 'Relationship of Procedures with Readmission'},
{'x': 'number_diagnoses', 'title': 'Relationship of Diagnoses with Readmission'}
]
for i, config in enumerate(plot_configs):
row, col = divmod(i, 3)
sns.kdeplot(data=data, x=config['x'], hue='readmitted', ax=axes[row, col])
axes[row, col].set_title(config['title'])
axes[1, 2].axis('off')
plt.tight_layout()
plt.show()
The three graphs depicting the correlations between lab procedures, hospital stay duration, and medication usage with the readmission status provide intriguing insights. Despite following a similar overall trend regardless of the readmission status, there are distinct patterns for each status category. Notably, it is evident that the "Not readmitted" status consistently exhibits higher values compared to the "Readmitted" status. Examining the relationship between lab procedures and readmission, we observe that both status categories generally increase and decrease simultaneously. However, patients who receive approximately 50 lab procedures are notably less likely to be readmitted. Similarly, a shorter initial hospital stay of 1 to 5 days is associated with a reduced likelihood of readmission. Furthermore, patients who are prescribed around 15 medications also show a decreased likelihood of readmission. Analyzing the number of procedures performed, we find that both status categories exhibit a similar trend, although the paths diverge more as the number of procedures increases. Notably, there appears to be a lower readmission rate when patients undergo 0 to 3 procedures. Additionally, exploring the correlation between the number of diagnoses and readmission status reveals minimal variance. However, a slight distinction can be observed when a patient receives more than 5 diagnoses, as they exhibit a lower likelihood of readmission. Overall, these findings highlight the potential influence of lab procedures, hospital stay duration, medication usage, and the number of diagnoses on readmission rates.
colors = ["#FFC0CB", "#FF69B4", "#FF1493"]
data["readmitted"].value_counts().plot.pie(autopct="%.1f%%", colors=colors)
plt.title("Readmission repartitions")
plt.show()
The bar plot shows the distribution of readmission statuses in the dataset. The readmission status is categorized into three groups:
NO: The patient was not readmitted.
more than 30: The patient was readmitted after more than 30 days.
less than 30: The patient was readmitted within 30 days.
diabetes_data=data
sns.set(style="whitegrid")
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# Plot for 'age'
sns.histplot(diabetes_data['age'], kde=True, ax=axes[0])
axes[0].set_title('Distribution of Age')
# Plot for 'time_in_hospital'
sns.histplot(diabetes_data['time_in_hospital'], kde=True, ax=axes[1])
axes[1].set_title('Distribution of Time in Hospital')
# Plot for 'num_medications'
sns.histplot(diabetes_data['num_medications'], kde=True, ax=axes[2])
axes[2].set_title('Distribution of Number of Medications')
plt.tight_layout()
plt.show()
corr_matrix = data_encoded.corr()
plt.figure(figsize=(15, 12))
sns.heatmap(corr_matrix, cmap='coolwarm')
plt.title('Heatmap of Encoded Data Correlations')
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.show()
NumMedicationsAverage = data_cleaned.groupby("race")["num_medications"].mean()
raceGender = data_cleaned.groupby("race")["race"].count()
TimeInHospital = data_cleaned.groupby("race")["time_in_hospital"].mean()
NumMedicationsAverage = pd.DataFrame({
"Race" : list(NumMedicationsAverage.index),
"NumMedications" : list(NumMedicationsAverage),
"raceGender" : list(raceGender),
"TimeInHospital" : list(TimeInHospital)})
fig = px.scatter(NumMedicationsAverage, x='NumMedications', y='TimeInHospital', color='raceGender', size='raceGender',
height=500, text='Race', log_x=True, log_y=True,
color_discrete_sequence=px.colors.qualitative.Vivid,
template='plotly_dark')
fig.update_traces(marker_size=30, marker_line_width=1, opacity=0.7)
fig.update_layout(plot_bgcolor = 'rgb(50,50,50)',
paper_bgcolor = 'rgb(50,50,50)',
font=dict(family="Arial, sans-serif",
size=15,
color="#FFFFFF"))
fig.show()
fig = px.scatter(data, x="time_in_hospital", y="age", size="time_in_hospital",
color="age", hover_name="time_in_hospital", size_max=30,
title='Gender Distribution by time_in_hospital',
color_discrete_map={'Male': 'blue', 'Female': 'pink'},
template='plotly_dark', width=500, height= 400)
fig.update_layout(plot_bgcolor = 'rgb(50,50,50)',
paper_bgcolor = 'rgb(50,50,50)',
font=dict(family="Arial, sans-serif",
size=15,
color="#FFFFFF"))
fig.show()
race_top = data['race'].value_counts()[:]
race_top = race_top.sort_values(ascending=False)
print("The top races in the dataset are:\n")
print("{:2s} {:20s} {:s}".format("#", "Race", "Count"))
for i, ranking in enumerate(race_top.index):
print("{:2d} {:20s} {:d}".format(i+1, ranking, int(race_top[ranking])))
rcParams['figure.facecolor'] = '#343131'
plt.style.use('dark_background')
#creating a 3D bar plot
fig1 = plt.figure(figsize=(15,15))
ax = plt.axes(projection='3d')
ax.bar3d(range(len(race_top)), range(len(race_top)), np.zeros(len(race_top)), 1, 1, race_top.values, shade=True, color=sns.color_palette("RdBu", n_colors=len(race_top.values)))
ax.set_xticks(range(len(race_top)))
ax.set_xticklabels(race_top.index)
ax.set_title("Top Races in the Dataset")
ax.set_xlabel("Race")
ax.set_ylabel("Ranking")
ax.set_zlabel("Count")
plt.show()
The top races in the dataset are: # Race Count 1 Caucasian 33164 2 AfricanAmerican 8575 3 Hispanic 860 4 Other 720 5 Asian 307
readmitted_data = data[data['isReadmitted'] == True]
pivot_table = pd.pivot_table(data, values='isReadmitted', index=['age'], columns=['race'], aggfunc='count', fill_value=0)
import plotly.express as px
def plot_treemap(data):
if 'time_in_hospital' in data.columns and 'race' in data.columns:
data_for_treemap = data.groupby(["time_in_hospital", "race"]).size().reset_index(name='counts')
fig = px.treemap(data_for_treemap, path=["time_in_hospital", "race"], values='counts',
title="Number of times spent by each race",
color_discrete_sequence=px.colors.qualitative.Dark2)
fig.data[0].textinfo = 'label+text'
fig.show()
else:
print("Required columns are not in the DataFrame")
plot_treemap(data)
def evaluate_model(X, y, classifier='RF', random_state=42):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
if classifier == 'RF':
model = RandomForestClassifier(n_estimators=100, random_state=random_state)
elif classifier == 'LR':
model = LogisticRegression(random_state=random_state)
elif classifier == 'DT':
model = DecisionTreeClassifier(random_state=random_state)
else:
raise ValueError("Unsupported classifier. Choose 'RF' for RandomForest, 'LR' for LogisticRegression, or 'DT' for DecisionTree.")
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', model)
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f"Model: {classifier}")
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n{report}")
#Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=y_test.unique(), yticklabels=y_test.unique())
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title(f'Confusion Matrix for {classifier}')
plt.show()
return y_pred, accuracy, report
X = dataForClassification.drop('readmitted', axis=1)
X = pd.get_dummies(X)
y = data_encoded['readmitted']
RF_y_pred, RF_accuracy, RF_report = evaluate_model(X, y, classifier='RF')
Model: RF
Accuracy: 0.6226757369614513
Classification Report:
precision recall f1-score support
0 0.65 0.82 0.73 6742
1 0.52 0.31 0.39 4283
accuracy 0.62 11025
macro avg 0.59 0.57 0.56 11025
weighted avg 0.60 0.62 0.60 11025
LR_y_pred, LR_accuracy, LR_report = evaluate_model(X, y, classifier='LR')
Model: LR
Accuracy: 0.6346485260770975
Classification Report:
precision recall f1-score support
0 0.64 0.91 0.75 6742
1 0.59 0.20 0.30 4283
accuracy 0.63 11025
macro avg 0.61 0.56 0.53 11025
weighted avg 0.62 0.63 0.58 11025
DT_y_pred, DT_accuracy, DT_report = evaluate_model(X, y, classifier='DT')
Model: DT
Accuracy: 0.5438548752834467
Classification Report:
precision recall f1-score support
0 0.63 0.61 0.62 6742
1 0.42 0.44 0.43 4283
accuracy 0.54 11025
macro avg 0.52 0.52 0.52 11025
weighted avg 0.55 0.54 0.55 11025
accuracies = {'LR': LR_accuracy, 'RF': RF_accuracy, 'DT': DT_accuracy}
plt.bar(accuracies.keys(), accuracies.values())
plt.xlabel('Classifier')
plt.ylabel('Accuracy')
plt.title('Comparison of Classifier Accuracies')
plt.show()
Our first results show us promising results in bout Linear Regression and Random Forest Classifiers, but Decision Tree seems to lack a few parameters to be completely operational.
def evaluate_model(X, y, feature_names, classifier='RF', random_state=42):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
if classifier == 'RF':
model = RandomForestClassifier(n_estimators=100, random_state=random_state)
elif classifier == 'LR':
model = LogisticRegression(random_state=random_state)
elif classifier == 'DT':
model = DecisionTreeClassifier(random_state=random_state)
else:
raise ValueError("Unsupported classifier. Choose 'RF' for RandomForest, 'LR' for LogisticRegression, or 'DT' for DecisionTree.")
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', model)
])
pipeline.fit(X_train, y_train)
if classifier in ['RF', 'DT']:
feature_importances = pipeline.named_steps['classifier'].feature_importances_
#Combines feature names and their importances, then sort them
feature_importance_pairs = sorted(zip(feature_names, feature_importances), key=lambda x: x[1], reverse=True)
print(f"Model: {classifier}")
print("Feature Importances (Descending):")
for name, importance in feature_importance_pairs:
print(f"{name}: {importance}")
if classifier == 'LR':
#Retrieves the model's coefficients
coefficients = pipeline.named_steps['classifier'].coef_[0]
#Combine feature names and their coefficients, then sort them
feature_coefficient_pairs = sorted(zip(feature_names, coefficients), key=lambda x: x[1], reverse=True)
print("Feature Coefficients (Descending):")
for name, coef in feature_coefficient_pairs:
print(f"{name}: {coef}")
evaluate_model(X, y, X.columns, classifier='RF')
Model: RF Feature Importances (Descending): health: 0.13944856242343415 num_lab_procedures: 0.12228688866277172 num_medications: 0.10581542615079932 time_in_hospital: 0.07607638756092942 age: 0.06476983200638976 num_procedures: 0.050125380473973837 number_diagnoses: 0.048963308300453194 number_inpatient: 0.01900303140089058 diag_3_Other: 0.0175999340275441 number_outpatient: 0.017497994117030475 diag_3_Circulatory: 0.016812376810658645 gender_Female: 0.015891160969829156 gender_Male: 0.015887524156369803 diag_1_Circulatory: 0.015615727735958186 diag_1_Other: 0.014790355277338693 discharge_disposition_id_Referral: 0.013476770403995534 diag_1_Respiratory: 0.012344556864803631 discharge_disposition_id_Other: 0.012017967646439011 number_emergency: 0.011275603160365102 diag_3_Diabetes: 0.010881445191778277 metformin_No: 0.010609021098365142 metformin_Steady: 0.010418074603075502 diag_1_Digestive: 0.01039304990052914 change_No: 0.01023476906144564 change_Ch: 0.010176811371080795 diag_3_Respiratory: 0.009683419826684869 glipizide_No: 0.009610437409541383 admission_source_id_Emergency: 0.009467386314602906 glipizide_Steady: 0.009075620946668002 diag_3_Genitourinary: 0.008733477646221935 admission_source_id_Referral: 0.008732420472440994 diag_1_Genitourinary: 0.007982419535780922 diag_1_Injury: 0.007741673444785552 diag_3_Digestive: 0.006875896594587546 admission_type_id_Emergency: 0.006184451544552506 admission_type_id_Elective: 0.0061195951930300705 diag_1_Musculoskeletal: 0.0061130051725721835 admission_source_id_Other: 0.005654659723370894 diag_1_Neoplasms: 0.005591065020083724 diabetesMed_No: 0.005311876083012215 discharge_disposition_id_18: 0.00527382110797224 diabetesMed_Yes: 0.004746549282060857 diag_3_Injury: 0.004214963502264784 diag_3_Musculoskeletal: 0.004036219778009884 diag_3_Neoplasms: 0.003992869458450163 metformin_Up: 0.0022489258331206546 glipizide_Up: 0.00174328322228958 metformin_Down: 0.0014938681583061594 glipizide_Down: 0.0014361661574670032 discharge_disposition_id_Emergency: 0.001361500412408972 diag_1_Pregnancy: 0.001269779080681933 discharge_disposition_id_23: 0.0010246763217487066 diag_3_Pregnancy: 0.0006030024021905921 diag_1_Diabetes: 0.0005388805403514824 discharge_disposition_id_28: 0.000454583668745858 discharge_disposition_id_24: 0.00014173728398512054 admission_type_id_Trauma Center: 4.806354748208742e-05 admission_type_id_New Born: 4.2629402584261597e-05 discharge_disposition_id_12: 2.046089439525105e-05 discharge_disposition_id_19: 9.664268961823985e-06 discharge_disposition_id_27: 8.99137233821077e-06
evaluate_model(X, y, X.columns, classifier='LR')
Feature Coefficients (Descending): number_inpatient: 0.21483572528857936 number_emergency: 0.11624915456463253 health: 0.10624136369348006 number_diagnoses: 0.09995854373120733 number_outpatient: 0.07623847524146755 time_in_hospital: 0.0756953119070123 age: 0.0650001337356093 diag_1_Circulatory: 0.05569944443215995 diabetesMed_Yes: 0.05475255536523958 admission_source_id_Emergency: 0.04014562089172916 discharge_disposition_id_28: 0.03618420653962188 admission_source_id_Referral: 0.03313911504147939 glipizide_Down: 0.026109078514487485 admission_type_id_Emergency: 0.02554799568465501 metformin_No: 0.0248120518516215 discharge_disposition_id_Referral: 0.023224081558972456 diag_3_Circulatory: 0.019823182469224394 gender_Female: 0.014651993539839554 diag_1_Other: 0.01260389392390893 glipizide_Steady: 0.010794480359940064 change_Ch: 0.010693403115696795 diag_3_Digestive: 0.00971100295859231 diag_3_Other: 0.0065007211765714 diag_3_Genitourinary: 0.0064842793936444355 diag_1_Respiratory: 0.005047609642537294 diag_3_Respiratory: 0.0037210779176871534 discharge_disposition_id_12: 0.0031687868648649347 admission_type_id_New Born: 0.002033035728410787 diag_3_Pregnancy: 0.0017442212561520644 discharge_disposition_id_Emergency: 0.001161316349126094 metformin_Down: -0.0005273443441462288 diag_3_Musculoskeletal: -0.0034489881597681263 diag_1_Musculoskeletal: -0.005818550921573946 diag_1_Digestive: -0.00861318581470544 glipizide_Up: -0.008983378472124583 change_No: -0.010693403115693592 discharge_disposition_id_Other: -0.011470646342594669 glipizide_No: -0.013665325424637952 discharge_disposition_id_18: -0.013810725093261322 gender_Male: -0.014651993539840112 diag_1_Injury: -0.017785114944291802 metformin_Up: -0.019213894245900864 admission_type_id_Elective: -0.019477319079029262 metformin_Steady: -0.02030208673628146 diag_1_Genitourinary: -0.021486380330947925 diag_3_Neoplasms: -0.02213709118333147 diag_3_Injury: -0.024651415694954746 diag_3_Diabetes: -0.028719926932409393 discharge_disposition_id_24: -0.02874360803376186 diag_1_Diabetes: -0.030959807752258678 num_medications: -0.032462888224795206 discharge_disposition_id_23: -0.03814133286328189 diabetesMed_No: -0.05475255536523958 discharge_disposition_id_27: -0.05682576758164103 diag_1_Neoplasms: -0.06396493176913116 discharge_disposition_id_19: -0.06643407995714638 diag_1_Pregnancy: -0.06652209842799275 num_lab_procedures: -0.07367467980460696 num_procedures: -0.07388248029631934 admission_source_id_Other: -0.13792650889343377 admission_type_id_Trauma Center: -0.141294739279956
evaluate_model(X, y, X.columns, classifier='DT')
Model: DT Feature Importances (Descending): health: 0.19467809029367722 num_lab_procedures: 0.14356510764026692 num_medications: 0.10757291867109377 time_in_hospital: 0.07109384352313515 age: 0.051177185659353405 num_procedures: 0.03711096598595999 number_diagnoses: 0.0361110500390211 diag_3_Circulatory: 0.019394766284820016 number_inpatient: 0.018940487656694815 diag_3_Other: 0.01594629938262362 diag_1_Circulatory: 0.015016276662505745 gender_Female: 0.01410552565672047 diag_1_Other: 0.013513260418592968 number_outpatient: 0.012895451789864495 gender_Male: 0.011282824196652353 diag_3_Genitourinary: 0.010712851402882029 diag_3_Respiratory: 0.010677553312502667 discharge_disposition_id_Referral: 0.010601807167048175 diag_1_Digestive: 0.010392105961269457 discharge_disposition_id_Other: 0.010124578247870207 diag_3_Diabetes: 0.009716379959136805 change_No: 0.009533271775835411 diag_1_Respiratory: 0.009522835380924305 glipizide_Steady: 0.009319130748747248 metformin_No: 0.008588199335121413 change_Ch: 0.008467884488902804 diag_1_Genitourinary: 0.008317829155426674 metformin_Steady: 0.008252965082819506 diag_1_Injury: 0.008215022420421984 admission_source_id_Emergency: 0.008192014668104063 admission_source_id_Referral: 0.008125429016038393 diag_3_Digestive: 0.007813974960041029 admission_type_id_Emergency: 0.007570498039228126 number_emergency: 0.007459523154536431 glipizide_No: 0.007067071643168371 diag_1_Neoplasms: 0.006146822925012325 diabetesMed_No: 0.0059760477097499706 discharge_disposition_id_18: 0.005383320503903086 admission_source_id_Other: 0.005333626248607413 diag_1_Musculoskeletal: 0.005194542982799461 admission_type_id_Elective: 0.005005373834801787 diag_3_Musculoskeletal: 0.004432912609087211 diag_3_Neoplasms: 0.003392786963379391 diag_3_Injury: 0.0031099358183306054 diabetesMed_Yes: 0.0029298760546444436 metformin_Down: 0.002443328880450783 metformin_Up: 0.001989402068666195 diag_1_Pregnancy: 0.0014134405889158308 glipizide_Down: 0.0013646007605203534 discharge_disposition_id_Emergency: 0.0012711365707662628 glipizide_Up: 0.0012430302760499156 discharge_disposition_id_23: 0.0007838759410484252 diag_3_Pregnancy: 0.0006233242716186621 diag_1_Diabetes: 0.00046025946997530864 discharge_disposition_id_28: 0.00025393882123982055 discharge_disposition_id_24: 0.00010771098715241646 admission_type_id_Trauma Center: 6.57259322732852e-05 admission_type_id_New Born: 0.0 discharge_disposition_id_12: 0.0 discharge_disposition_id_19: 0.0 discharge_disposition_id_27: 0.0
We can now get a better grasp of the core components of our analysises, we realise our created variable 'health' has one of the highest scores.
def evaluate_model_with_gridsearch(X, y, classifier='RF', random_state=42):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
if classifier == 'RF':
model = RandomForestClassifier(random_state=random_state)
param_grid = {
'classifier__n_estimators': [700], #Values tested: 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000
'classifier__max_depth': [10,20] #Values tested: 0.01, 0.1, 1, 5, 10, 15, 20, 40, 50, 100
}
elif classifier == 'LR':
model = LogisticRegression(random_state=random_state)
param_grid = {
'classifier__C': [100, 0.1, 5, 10],
'classifier__penalty': ['l2', 'none'],
'classifier__solver': ['lbfgs', 'liblinear']
}
elif classifier == 'DT':
model = DecisionTreeClassifier(random_state=random_state)
param_grid = {
'classifier__max_depth': [10, 20, 30, None],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4],
'classifier__criterion': ['gini', 'entropy']
}
else:
raise ValueError("Unsupported classifier. Choose 'RF' for RandomForest, 'LR' for LogisticRegression or 'DT' for DecisionTree Classifier.")
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', model)
])
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_}")
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f"Model: {classifier}")
print(f"Accuracy on Test Set: {accuracy}")
print(f"Classification Report on Test Set:\n{report}")
evaluate_model_with_gridsearch(X, y, classifier='RF') #20 and 700
Best Parameters: {'classifier__max_depth': 20, 'classifier__n_estimators': 700}
Best CV Score: 0.6270748299319727
Model: RF
Accuracy on Test Set: 0.6351927437641723
Classification Report on Test Set:
precision recall f1-score support
0 0.65 0.88 0.75 6742
1 0.57 0.25 0.35 4283
accuracy 0.64 11025
macro avg 0.61 0.56 0.55 11025
weighted avg 0.62 0.64 0.59 11025
evaluate_model_with_gridsearch(X, y, classifier='LR')
Best Parameters: {'classifier__C': 100, 'classifier__penalty': 'l2', 'classifier__solver': 'lbfgs'}
Best CV Score: 0.6244671201814059
Model: LR
Accuracy on Test Set: 0.6346485260770975
Classification Report on Test Set:
precision recall f1-score support
0 0.64 0.91 0.75 6742
1 0.59 0.20 0.30 4283
accuracy 0.63 11025
macro avg 0.61 0.56 0.53 11025
weighted avg 0.62 0.63 0.58 11025
evaluate_model_with_gridsearch(X, y, classifier='DT')
Best Parameters: {'classifier__criterion': 'entropy', 'classifier__max_depth': 10, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 5}
Best CV Score: 0.617891156462585
Model: DT
Accuracy on Test Set: 0.6234920634920635
Classification Report on Test Set:
precision recall f1-score support
0 0.65 0.85 0.73 6742
1 0.53 0.26 0.35 4283
accuracy 0.62 11025
macro avg 0.59 0.56 0.54 11025
weighted avg 0.60 0.62 0.59 11025
We can now set our best parameters for our models, so that now we can re-evaluate them.
def evaluate_model(X, y, feature_names, classifier='RF', random_state=42):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
if classifier == 'RF':
model = RandomForestClassifier(max_depth=20, n_estimators=700, random_state=random_state)
elif classifier == 'LR':
model = LogisticRegression(penalty = 'l2',solver = 'lbfgs', random_state=random_state)
elif classifier == 'DT':
model = DecisionTreeClassifier(criterion='entropy',max_depth=10,min_samples_leaf=2,random_state=random_state)
#criterion': 'entropy', 'classifier__max_depth': 10, 'classifier__min_samples_leaf': 2,
#'classifier__min_samples_split': 5}
else:
raise ValueError("Unsupported classifier. Choose 'RF' for RandomForest or 'LR' for LogisticRegression.")
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', model)
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f"Model: {classifier}")
print(f"Accuracy: {accuracy}")
print(f"Classification Report:\n{report}")
# Plotting Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=y_test.unique(), yticklabels=y_test.unique())
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title(f'Confusion Matrix for {classifier}')
plt.show()
return y_pred, accuracy, report
X = dataForClassification.drop('readmitted', axis=1)
X = pd.get_dummies(X)
y = data_encoded['readmitted']
RF_y_pred, RF_accuracy, RF_report = evaluate_model(X, y, X.columns, classifier='RF')
Model: RF
Accuracy: 0.6351927437641723
Classification Report:
precision recall f1-score support
0 0.65 0.88 0.75 6742
1 0.57 0.25 0.35 4283
accuracy 0.64 11025
macro avg 0.61 0.56 0.55 11025
weighted avg 0.62 0.64 0.59 11025
LR_y_pred, LR_accuracy, LR_report = evaluate_model(X, y, X.columns, classifier='LR')
Model: LR
Accuracy: 0.6346485260770975
Classification Report:
precision recall f1-score support
0 0.64 0.91 0.75 6742
1 0.59 0.20 0.30 4283
accuracy 0.63 11025
macro avg 0.61 0.56 0.53 11025
weighted avg 0.62 0.63 0.58 11025
from sklearn.tree import DecisionTreeClassifier
DT_y_pred, DT_accuracy, DT_report = evaluate_model(X, y, X.columns, classifier='DT')
Model: DT
Accuracy: 0.6236734693877551
Classification Report:
precision recall f1-score support
0 0.65 0.85 0.73 6742
1 0.53 0.26 0.35 4283
accuracy 0.62 11025
macro avg 0.59 0.56 0.54 11025
weighted avg 0.60 0.62 0.59 11025
accuracies = {'LR': LR_accuracy, 'RF': RF_accuracy, 'DT': DT_accuracy}
plt.bar(accuracies.keys(), accuracies.values())
plt.xlabel('Classifier')
plt.ylabel('Accuracy')
plt.title('Comparison of Classifier Accuracies')
plt.show()
Because of Grid Search, We enhanced our performances on our three models. We can see that the three values converge around 0,64. We can then come close to the conclusion that our dataset can only allow us so much.
dataForClassification.columns
Index(['gender', 'age', 'admission_type_id', 'discharge_disposition_id',
'admission_source_id', 'time_in_hospital', 'num_lab_procedures',
'num_procedures', 'num_medications', 'number_outpatient',
'number_emergency', 'number_inpatient', 'diag_1', 'diag_3',
'number_diagnoses', 'metformin', 'glipizide', 'change', 'diabetesMed',
'readmitted', 'health'],
dtype='object')
columns_to_drop = [
'admission_source_id',
'diag_3', 'diag_1', 'diabetesMed',
'metformin', 'glipizide', 'metformin', 'glipizide',
'discharge_disposition_id', 'admission_type_id'
]
X = dataForClassification.drop(['readmitted'] + columns_to_drop, axis=1)
X = pd.get_dummies(X)
y = data_encoded['readmitted']
evaluate_model(X, y, X.columns, classifier='RF')
Model: RF
Accuracy: 0.617233560090703
Classification Report:
precision recall f1-score support
0 0.65 0.83 0.73 6742
1 0.51 0.28 0.37 4283
accuracy 0.62 11025
macro avg 0.58 0.56 0.55 11025
weighted avg 0.59 0.62 0.59 11025
evaluate_model(X, y, X.columns, classifier='DT')
Model: DT
Accuracy: 0.6222222222222222
Classification Report:
precision recall f1-score support
0 0.64 0.86 0.74 6742
1 0.53 0.25 0.34 4283
accuracy 0.62 11025
macro avg 0.59 0.55 0.54 11025
weighted avg 0.60 0.62 0.58 11025
evaluate_model(X, y, X.columns, classifier='LR')
Model: LR
Accuracy: 0.6296598639455783
Classification Report:
precision recall f1-score support
0 0.63 0.93 0.75 6742
1 0.59 0.16 0.25 4283
accuracy 0.63 11025
macro avg 0.61 0.54 0.50 11025
weighted avg 0.62 0.63 0.56 11025
Removing columns with few influence did impact the model on a small scale, that makes us wonder if it were needed to scale up, to gain HC power we could simply remove them. On a smaller scale like here, we will keep them so that we can have the best prediction.
dataForReg.head()
| gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | insulin | change | diabetesMed | health | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Female | 15 | Emergency | Referral | Emergency | 3 | 59 | 0 | 18 | 0 | 0 | 0 | Other | Other | Other | 9 | Up | Ch | Yes | -0.148877 |
| 3 | Male | 35 | Emergency | Referral | Emergency | 2 | 44 | 1 | 16 | 0 | 0 | 0 | Other | Other | Circulatory | 7 | Up | Ch | Yes | -1.994875 |
| 4 | Male | 45 | Emergency | Referral | Emergency | 1 | 51 | 0 | 8 | 0 | 0 | 0 | Neoplasms | Neoplasms | Diabetes | 5 | Steady | Ch | Yes | -4.544908 |
| 5 | Male | 55 | Emergency | Referral | Referral | 3 | 31 | 6 | 16 | 0 | 0 | 0 | Circulatory | Circulatory | Diabetes | 9 | Steady | No | Yes | 1.529434 |
| 7 | Male | 75 | Emergency | Referral | Emergency | 5 | 73 | 0 | 12 | 0 | 0 | 0 | Circulatory | Respiratory | Diabetes | 8 | No | No | Yes | -0.001356 |
dataForReg.shape
(55125, 20)
X = dataForReg.drop("time_in_hospital", axis=1)
X = X.drop("health",axis = 1)
X = pd.get_dummies(X)
y = dataForReg["time_in_hospital"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Our approach involved a straightforward process: excluding health parameters. Health, in this context, is an aggregate of components considering the number of days spent in the hospital. The solution to our inquiry was inherent in the question itself. Additionally, we had to convert discrete variables into a binary format using dummy variables, ensuring compatibility with the sklearn library.
models = [
("Linear Regression", LinearRegression()),
("Ridge Regression", Ridge()),
("Lasso Regression", Lasso()),
("Decision Tree", DecisionTreeRegressor(random_state=42)),
("Random Forest", RandomForestRegressor(random_state=42)),
("Gradient Boosting", GradientBoostingRegressor(random_state=42)),
]
model_performance = {}
feature_importance_dict = {}
for name, model in models:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# calcul de RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
model_performance[name] = rmse
# to get the most importants variables it depends on the model
if name in ["Linear Regression", "Ridge Regression", "Lasso Regression"]:
importance = model.coef_
else:
importance = model.feature_importances_
feature_importance = pd.DataFrame(importance, index=X_train.columns, columns=["Importance"])
feature_importance = feature_importance.sort_values("Importance", ascending=False)
feature_importance_dict[name] = feature_importance
# printing models performances
model_performance_df = pd.DataFrame(list(model_performance.items()), columns=['Model', 'RMSE'])
print(model_performance_df)
for name, importance in feature_importance_dict.items():
print(f"Importance des features pour {name}:")
print(importance.head(10))
print("\n")
Model RMSE
0 Linear Regression 2.425287
1 Ridge Regression 2.424914
2 Lasso Regression 2.528947
3 Decision Tree 3.326711
4 Random Forest 2.359572
5 Gradient Boosting 2.339490
Importance des features pour Linear Regression:
Importance
discharge_disposition_id_27 3.544452
discharge_disposition_id_23 1.884468
discharge_disposition_id_18 0.788320
diag_1_Neoplasms 0.754714
admission_type_id_Trauma Center 0.665000
discharge_disposition_id_Other 0.620263
diag_1_Pregnancy 0.491586
diag_1_Other 0.370570
admission_source_id_Other 0.336717
admission_type_id_Emergency 0.275796
Importance des features pour Ridge Regression:
Importance
discharge_disposition_id_27 2.344777
discharge_disposition_id_23 1.848584
discharge_disposition_id_18 0.762514
diag_1_Neoplasms 0.754616
admission_type_id_Trauma Center 0.611150
discharge_disposition_id_Other 0.594723
diag_1_Pregnancy 0.485953
diag_1_Other 0.370963
admission_source_id_Other 0.336610
admission_type_id_Emergency 0.265064
Importance des features pour Lasso Regression:
Importance
num_medications 0.129605
num_lab_procedures 0.034375
age 0.014310
diabetesMed_No 0.000000
change_No -0.000000
diag_1_Pregnancy 0.000000
diag_1_Respiratory -0.000000
diag_2_Circulatory -0.000000
diag_2_Diabetes -0.000000
diag_2_Digestive 0.000000
Importance des features pour Decision Tree:
Importance
num_medications 0.265303
num_lab_procedures 0.176683
age 0.052088
num_procedures 0.051753
number_diagnoses 0.040806
number_inpatient 0.016910
discharge_disposition_id_Referral 0.015537
number_outpatient 0.015141
diag_3_Other 0.014773
diag_2_Other 0.013906
Importance des features pour Random Forest:
Importance
num_medications 0.271208
num_lab_procedures 0.174330
age 0.052580
num_procedures 0.050526
number_diagnoses 0.040372
number_inpatient 0.017384
discharge_disposition_id_Referral 0.016276
diag_1_Other 0.014671
number_outpatient 0.013804
diag_3_Other 0.013477
Importance des features pour Gradient Boosting:
Importance
num_medications 0.552992
num_lab_procedures 0.206317
num_procedures 0.040134
discharge_disposition_id_Referral 0.033622
age 0.025623
number_diagnoses 0.019056
admission_type_id_Elective 0.018685
diag_1_Other 0.013985
diag_1_Neoplasms 0.010191
discharge_disposition_id_Other 0.008476
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
sampled_data = dataForReg.sample(500, random_state=42)
fig = plt.figure(figsize=(15, 10))
ax1 = fig.add_subplot(221, projection='3d')
ax1.scatter(sampled_data['num_medications'], sampled_data['num_lab_procedures'], sampled_data['time_in_hospital'], alpha=0.5)
ax1.set_title('Perspective 1')
ax1.set_xlabel('num_medications')
ax1.set_ylabel('num_lab_procedures')
ax1.set_zlabel('time_in_hospital')
ax1.view_init(elev=20, azim=30)
ax2 = fig.add_subplot(222, projection='3d')
ax2.scatter(sampled_data['num_medications'], sampled_data['num_lab_procedures'], sampled_data['time_in_hospital'], alpha=0.5)
ax2.set_title('Perspective 2')
ax2.set_xlabel('num_medications')
ax2.set_ylabel('num_lab_procedures')
ax2.set_zlabel('time_in_hospital')
ax2.view_init(elev=20, azim=-60)
ax3 = fig.add_subplot(223, projection='3d')
ax3.scatter(sampled_data['num_medications'], sampled_data['num_lab_procedures'], sampled_data['time_in_hospital'], alpha=0.5)
ax3.set_title('Perspective 3')
ax3.set_xlabel('num_medications')
ax3.set_ylabel('num_lab_procedures')
ax3.set_zlabel('time_in_hospital')
ax3.view_init(elev=5, azim=90)
ax4 = fig.add_subplot(224, projection='3d')
ax4.scatter(sampled_data['num_medications'], sampled_data['num_lab_procedures'], sampled_data['time_in_hospital'], alpha=0.5)
ax4.set_title('Perspective 4')
ax4.set_xlabel('num_medications')
ax4.set_ylabel('num_lab_procedures')
ax4.set_zlabel('time_in_hospital')
ax4.view_init(elev=30, azim=150)
plt.show()
model_performance_df = pd.DataFrame(list(model_performance.items()), columns=['Model', 'RMSE'])
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='RMSE', data=model_performance_df, palette='viridis')
plt.title('Performance des modèles de régression')
plt.ylabel('RMSE')
plt.xlabel('Modèle')
plt.xticks(rotation=45)
plt.show()
Above, we observe that the Gradient Boosting approach appears to be the most effective, although it is closely trailed by the other methods, with the exception of the decision tree, which performs notably worse. Regarding the most influential parameters, we can confidently assert that they are 'num_medications' and 'num_lab_procedures.' This correlation makes perfect sense, as a higher count of lab procedures is associated with a prolonged hospital stay. However, it comes as a surprise to us that the number of medications is not ranked higher; this finding is unexpected.
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X = dataForReg.drop("time_in_hospital", axis=1)
X = X.drop("health", axis=1)
X = pd.get_dummies(X)
y = dataForReg["time_in_hospital"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95)),
('regressor', LinearRegression())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE with PCA: {rmse}")
RMSE with PCA: 2.46990744845903
Here we can observe that the PCA got a bad score (2.45) compared to without (2.40). This mean that the PCA do not represent well the data.
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 4, 5],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
param_grid = {
'n_estimators': [50, 100],
'learning_rate': [0.01, 0.1],
'max_depth': [3, 4],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}
#Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(random_state=42)
# Create the GridSearchCV object
grid_search = GridSearchCV(
estimator=gb_regressor,
param_grid=param_grid,
scoring='neg_mean_squared_error', #appropriate scoring metric
cv=5,
n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train)
print("Best Parameters: ", grid_search.best_params_)
best_gb_model = grid_search.best_estimator_
y_pred_grid_search = best_gb_model.predict(X_test_scaled)
# Calculate RMSE
rmse_grid_search = np.sqrt(mean_squared_error(y_test, y_pred_grid_search))
print("RMSE with Grid Search: ", rmse_grid_search)
testet
Best Parameters: {'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
RMSE with Grid Search: 2.321570509543093
Here we observe that with hyperparametrisation we got a good improvement : from 2.469 to 2.3216.
We have now studied this dataset from the Machine Learning Models point of view, and can conclude to good results with the information we have extracted.
With the obtained results, we can already make some previsions that lean more towards preventive measures than pinpoint predictions. This shift from a purely reactive to a more preventive stance in healthcare could significantly change how we approach chronic diseases like diabetes. It emphasizes the importance of early intervention and continuous, personalized care to manage the disease more effectively.
In conclusion, while our study has made considerable progress in predicting patient readmission, it also highlights the need for further research. This research should aim to uncover the less visible variables affecting diabetes, improve measurement methods, and ultimately, guide us towards more effective and personalized healthcare solutions.